Medical insurance prediction using regression models

by Kenny William Nyallau

Main objective of analysis

The main objective of the analysis is to utilize regression-only models in order to predict medical insurance costs that would be incurred to hospital patients using their electronic health records. We will first perform the typical exploratory data analysis, then data pre-processing and finally model training and evaluation. Also, since the focus to build, evaluate and explaining the model, the exploratory data analysis will be brief.

The dataset that we will be using is provided by Brett Lantz via his github repository.

Brief description of dataset and summary of its attributes

The hypothetical electronic health records attributes are the following:

Exploratory Data Analysis

The data is immaculate as we have no missing values.

As we can see, the charges for the patients increase linearly based on the patients' age. Also, the charges for male are generally higher than female patients.

We can observe that the "charges" distribution is right-skewed. We can apply either Box-Cox transformation and log tranformation in order to normalize the distribution.

Feature selection

As we can see from the feature importance result, all of the six features were presented as important for training and none were discarded.

Multicollinearity

As general rule of thumb, we need to ensure that our regression model does not have issues with multicollinearity in which the independent variables do not high corellation between each other. The most common trick is to reduce multicollinearity is to use Variance Inflation Factor(VIF). We will call the stats' variance_inflation_factor method and detect every feature for multicollinearity.

We can observe that there are two features (age and bmi) that have VIF higher than 5. That's means these variables have high mulitcollinearity. There are a few ways to deal with mulitcollinearity and one of them is to remove one of the features from our data. So let's try to remove the feature "age" and see if the high mulitcollinearity still exists.

That looks good but now the model will suffer from bad performance because we have reduced the interpretability of the model of what seemed to be an important feature. To avoid dropping features, we can use one hot encoding to fix this instead.

As you can see, we have effectively removed the multicollinearity without removing any features from our dataset.

Model training and evaluation

As mentioned above, we are going to perform the prediction task based on regression models rather than using other classifiers such as clustering or decision trees. In this case, we will start with the base model Linear regression, followed by Ridge regression, Lasso regression and finally Polynomial regression. At the same time, we will compare each of their performances using metrics such as R-Square and Mean Square Error.

Linear Regression

Linear regression is a supervised learning algorithm that describes the relationship between some number of independent variable X and an independent variable Y in order to predict a continous numerical outcome using a best git straight line. The formula for the linear model is:

A simple linear regression can be described as the following fomula:

$y = {\beta} _{0} + {\beta}_{1}{x}$

Whereby:

$y$ is the target output

$x$ is the feature

${\beta} _{0}$ is the intercept

${\beta} _{1}$ is the coefficent of $x$

When we are dealing with multiple features then we can simply use multiple regression which is just another extension of linear regression:

$y = {\beta} _{0} + {\beta}_{1}{x}_{1} + {...} + {\beta}_{n}{x}_{n}$

where ${n}$ is the number of features

Model evaluation

For regression tasks, we will typically use the following metrics:

  1. Mean Absolute Error (MAE):

    $\frac{1}{n} \sum_{i = 1}^{n}{|{y}_{i} - \hat{y}_{i}|}$

  2. Mean Squared Error (MSE):

    $\frac{1}{n} \sum_{i = 1}^{n}{({y}_{i} - \hat{y}_{i})^2}$

  3. Root Mean Squared Error (RMSE):

    $\sqrt{\frac{1}{n} \sum_{i = 1}^{n}{({y}_{i} - \hat{y}_{i})^2}}$

We will apply those evaluation metrics to the rest of the regression models.

Ridge Regression

Ridge regression is a model tuning method that is being used to perform analysis data that has high multicollinearity in multiple regression data.

Lasso Regression

Lasso or Least Absolute Shrinkage and Selection Operator serves as method for feature selection and model regularization. It reguralizes model parameters by shrinking the coefficients and reducing them down to zero.

Polynomial Regression

If the relationship between the variables are not linear but the data is correlated, then Linear regression might not be the best solution. In this case, Polynomial Regression can be used to achieve minimum error or minimum cost function as well being able to fit a polynomial line. However, it maybe sensitive to outliers.

Summary of data cleaning and feature engineering

  1. We have tranformed the categorical features 'smoker', 'region' into numerical using one-hot encoding.
  2. The dataset is immaculate--there are no missing values so we did not perform any kind data imputation.
  3. We preprocess the feature "ages" into a categorical group called "age_category" for further exploratory data analysis.
  4. We utilize Lasso regression to find the most significant features for model training.
  5. We normalize the "charges" distribution using Box-Cox transformation and log transformation. For the final model, we utilize log transformation on the "charge" feature.
  6. Our dataset suffers from high multicollinearity after we detect it using variance inflation factor.

Summary of model training

  1. We use all of the features (except our target variable, "charges") and initialized the random state to 42 to make sure the train test split setting is consistent.
  2. In this experiment, we only use Linear regression (base model), Ridge, Lasso and Polynomial regression.
  3. We evaluate the model using R-square score and mean squared error.

Summary of Key findings and insights

  1. We have found out that Polynomial regression has the highest r-square score with 0.83 compared to Linear regression(0.78), Ridge(0.78) and Lasso(0.65). That means, with Polynomial regression, 84% of the dependent variability can be explained by the model.
  2. The dataset itself suffers from high multicollinearity and we rectified the issue by converting the categorical values to dummy variables.
  3. Despite we have managed to acquire relatively good performance with Polynomial model, the dataset itself is too simple and clean (in the real world, data will be noisy) and therefore we cannot the evaluate the regression models' performance as a real measure of success within real-world application. It is always important to take note of how the structure (i.e. number of features, number of observations) of the dataset may influence the model's performance as well. Since regression problem tends to be sensitive to multicollinearity, it is recommended to experiment with other type of predictive models. As a simple exercise for regression task, the quality of such dataset will suffice.

Next steps of data analysis

  1. Use different types of models to improve the prediction even further, using other predictors such as clustering (i.e., K-Nearest neighbour), Support Vector Machines, ensemble learning and decision trees.
  2. Find a much more exhaustive data to work with in order to fully evaluate the strength of regression models. Such exhaustive datasets can be difficult to find simply because patient electronic heath records suffers from privacy concerns and because of that, we have limited public records in order to be for educational purposes.